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Abstract 


The U.S. Bureau of Labor Statistics use monthly, by-state employment totals from the Current 
Population Survey (CPS) as a key input to develop employment estimates for counties within 
the states. The monthly CPS by-state totals, however, express high levels of volatility that 
compromise the accuracy of resulting estimates composed for the counties. Typically-employed 
models for small area estimation produce de-noised, state-level employment estimates by bor¬ 
rowing information over the survey months, but assume independence among the collection of 
by-state time series, which is typically violated due to similarities in their underlying economies. 
We construct Gaussian process and Gaussian Markov random field alternative functional prior 
specifications, each in a mixture of multivariate Gaussian distributions with a Dirichlet process 
(DP) mixing measure over the parameters of their covariance or precision matrices. Our DP 
mixture of functions models allow the data to simultaneously estimate a dependence among 
the months and between states. A feature of our models is that those functions assigned to 
the same cluster are drawn from a distribution with the same covariance parameters, so that 
they are similar, but don’t have to be identical. We compare the performances of our two 
alternatives on synthetic data and apply them to recover de-noised, by-state CPS employment 
totals for data from 2000 — 2013. 
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1. Introduction 


The Current Population Survey (CPS), a household survey administered by the Census Bu¬ 
reau for the Bureau of Labor Statistics (BLS), publishes time-indexed, state-level estimates 
of variables relating to employment, such as the employment total. The time-indexed em¬ 
ployment estimates express a high degree of volatility, so that BLS will apply independently 
state-specified regression models to reduce the volatility before proportioning these estimates 
to labor market area or county as part of their local area unemployment survey (LAUS) pro¬ 
gram. This approach assumes independence among the state-indexed collection of time series. 
State administrators frequently express concern that the resulting county-level estimates com¬ 
posed from the state-level regression models are overly volatile, in that they believe much of 
the month-over-month movement is noise, not signal that they would be able to attribute to 
underlying economic drivers. So the LAUS program seeks a more accurate means to extract 
de-noised, state-level estimates of employment totals such that state administrators are readily 
able to ascribe economic drivers to the resulting county-level employment trends. 

Bayesian nonparametric alternatives, such as the Gaussian process construction imple¬ 


mented by Vanhatalo et al. (2014), treat the collection of time series as sums of latent func¬ 


tions and a noise process. The structure of the model is focused to the latent functions, which 
are generated from a multivariate Gaussian distribution under covariance constructions that 


regulate the smoothness properties of the resulting functions (Rasmusen and Williams, 2006). 


Vanhatalo et al. (2014) specify their Gaussian process prior as independent over the states. 


Dependence among a set of noisy functions over the states using a Gaussian process prior 
formulation would require vectorizing the functions into a single function or vector of length 
equal to the number of states multiplied by the number of time-indexed measurement waves. 
This approach would be computationally prohibitive were one to sample the full joint pos- 
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terior distribution as computation time scales with order equal of the cube of the resulting 
vector length. The rendering of sets of functions into a single vector is also not a natural way 
to perform inference because the ordering of the states may notably impact both global and 
local smoothness properties of the resulting vector and this approach assumes a continuous 
transition between state functions. 


We offer an approach that builds on the independent nonparametric Bayesian functional 
models by probabilistically tying the set of latent functions together under a marginal non¬ 
parametric mixture prior, where the mixing measure receives a Dirichlet process (DP) prior 
(Ferguson |1973 ). Our modeling framework is similar to Gelfand et al. (2005), but their 
approach imposes the Dirichlet process prior directly on the functions, so that co-clustered 
functions must be exactly the same, which we find to be too restrictive for working with our 
CPS employment data application. We will, instead, impose the DP prior on the covariance 
matrix parameters of the generating Gaussian distribution, such that state functions which are 
co-clustered are drawn from a Gaussian distribution with the same parameters, but are not 
required to be exactly equal. 


We formulate our nonparametric mixture approach using two alternative prior construc¬ 
tions to parametrize the covariance matrix of a multivariate Gaussian prior imposed on the 
set of latent functions; 1. A Gaussian process prior; 2. An intrinsic Gaussian Markov random 
field prior. These alternative formulations are developed and their properties contrasted in 
Section [2j Schemes to sample the posterior distributions under each of our two functional 
prior alternatives are sketched in Section [3j where we comment on their computational proper¬ 
ties. We compare their performance, both in fitting the functions and uncovering a clustering 
structure used to generate the functions in Section [4j Our nonparametric mixture models are 
applied for estimation of state-level employment totals obtained from the CPS in Section [5j 
followed by a concluding discussion in Section [6j 
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2. Models for Functional Data 

2.1 CPS Employment Count Data 

Our data are composed for of T = 158 months of direct estimates of employment totals for 
each of N = 51 states (including the District of Columbia) published by the BLS in the 
Current Population Survey. Our modeling interest is focused on reducing estimation volatility 
to improve the precision of the monthly, state-level employment totals. The employment totals 
are influenced by the underlying economic conditions of the states. We, therefore, expect 
a correlation of trends in the mix of industrial, service and agricultural economic activities 
among states, on the one hand, with those expressed in their employment total time series, 
on the other hand. State policy makers seek context around their reported employment and 
unemployment levels in order to understand the drivers for the estimated trends. We will 
later demonstrate that performing inference on clusters of distributions generating the set of 
state-indexed functions provides such context. 

We denote the set of (T = 158) x 1 vectors of survey direct estimates for a collection 
of N = 51 states with {y?:}i=i,...,Ar. We would like to estimate de-noised, smooth functions, 
{f;}, from the {yi}. We estimate the dependence among the collection of state employment 
totals through a prior construction that permits a grouping or clustering of covariance (under a 
Gaussian process (GP) prior specification) or precision parameters (under an intrinsic Gaussian 
Markov random held (iGMRF) prior specification), that we index by state. These state-indexed 
covariance or precision parameters are used to generate the latent {f*}. 

We first outline GP and iGMRF formulations with simpler, global covariance and precision 
parameters, respectively, (0,k), not indexed by state, to illustrate the properties of functions 
generated under these prior specifications. Employing a single set of covariance or precision 
parameters assumes the functions are exchangeably drawn. We will subsequently extend both 
of these models using nonparametric mixture formulations over the states for {R,} and {/«*} 
under the GP and iGMRF models, respectively, to allow for dependence among the functions. 
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2.2 Gaussian process Model 


Our Gaussian process (GP) model is parameterized through the covariance matrix, C (0), 
where 0 are estimated by the data and control the trend, length scale (or wavelength) and 
variation in generated functions, {£;}. We specify a GP formulation in the following probability 
model: 


Txl ind 


( 0, C (0) 




TxT 


01 ,..., 8p\a, b ~ Qa (a = 1.0, b = 1.0) 


iid 


r e |c, d ~ Qa(c = 1.0, d = 1.0) 


(la) 

(lb) 

(lc) 

(l d) 


where 0 = (0i,..., dp). The GP model specifies a formula for the covariance matrix, C (0), 
with, 


GW = (CfpfJ 


i/6(l,...,T) 


2 \ 


r< = 1 , , ML, 

fjJi 0 ! \ 0203 ) ’ 

for P = 3. The particular covariance formula we use is known as the rational quadratic, 
which may be derived as a scale mixture of more commonly-used squared exponential kernels, 
1/01 exp {{tj — tf) 2 /0), with the inverse of the length scale parameter, 0 -1 , which controls 
function periodicity, distributed under a Beta distribution with hyperparameters ( 03 , 0^ 1 ) 


(Rasmusen and Williams, 2006). The vertical magnitude of a function drawn under the rational 


quadratic covariance formula is directly controlled by 0i, while 02 controls the mean length 
scale or period of the function, and 03 controls smooth deviations from 02. As 03 f oo, this 
formulation converges to a single squared exponential kernel with length-scale, 02- Figure [l] 
displays a randomly-generated function under each of 3 distinct value settings for for 0 = 
( 01 , 02 , 03 ) of the rational quadratic covariance formula to provide insight into how 0 produces 
distinct features in f. Our choice of the rational quadratic covariance formulation is intended 
as a parsimonious specification for parameterizing the use of a single covariance matrix that 
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is able to discover multiple length-scales or wavelengths in a function, in lieu of using a set 
of squared exponential covariance terms (e.g. in an additive formulation) with each term 
specialized to a particular length scale. 

The rational quadratic formula also produces smooth surfaces, {f,;}, because they are con¬ 
strained to be differentiable at all orders. We prefer covariance formulas that render smooth 
functions, {£;}, because the smoothness property more identifiably separates signal from rough, 
non-differentiable, noise. A primary goal for this work is to extract de-noised functions from 
noisy time series. The GP of Equation [l] under the rational quadratic covariance formula is 
very flexible, allowing the estimation of functions, {fj}, to be non-linear of any order within the 
space of smooth functions. This GP is equivalent to an infinite dimension polynomial spline, 
even though the parameterization of the GP is through the covariance matrix, rather than the 
mean. A linear mean model may be re-parameterized as a zero mean GP by marginalizing over 
the regression coefficients, which demonstrates that linear surfaces are a subset of the space of 
smooth functions parameterized in Equation [l] (Neal 2000a). The structure of Equation [I] is 
contained in f), so that the {y^} are assumed to conditionally-independent, given {£;}. Stan¬ 
dard errors for the monthly CPS direct estimates were not available to us. We treat the model 
noise precision, r e , as unknown, such that it is estimated by the data, because this parameter 
influences the amount of signal, represented by f, extracted from the noisy response, y, and 
is a feature of the GP model. Our probability model in Equation [lj that specifies a Gamma 
prior for r e , is fully identified. 


2.3 Intrinsic Gaussian Markov Random Field Model 

An iGMRF prior may be viewed as a probabilistic local smoother composed from differences 
in function values, which in our case, are indexed by time. A typical iGMRF specification 
is placed on the first difference approximation to the first derivative, A/y = fi.j+\ — fi.j ~ 
A/"(0,k^ 1 ), where k is the earlier discussed precision parameter that determines the (ver¬ 
tical) scale of variation among the first differences. This approach uses nearest neighbors 
defined from first differences, fi,j+i), to encode the time-indexed dependence struc- 
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Figure 1: Randomly-drawn functions, f, from a GP using a rational quadratic covariance 
function under distinct values for 6 = ($1,6*2, #3). The top panel employs 6 = (0.4,0.2, 5.0) 
to generate a “rough” surface, while the bottom panel specifies (0.4,4.0, 5.0) to characterize 
a “smooth” surface, and the middle panel employs (0.4, 0.2, 5.0) to produce a rough surface 
characterized by a high-degree of length-scale variation. 


ture among the fij , j = 1 ,...,T for each domain, i £ (1 Using first differences 

may produce an excessively rough (though continuous) surface that over-fits the data (by 
making it difficult to separately model signal and noise processes), so we employ a prior 
construction based on the second difference approximation to the first derivative, A 2 fij = 
A (A fij) ~ A f (0, k _ 1 ). This prior on second differences produces the joint distribution, f) ~ 
exp J2j=i (.fij - 2/ij+i + fi t j +2 )) = exp (-^f-Rfi), where R = kQ specifies 
a band-diagonal precision matrix with non-zero entries, (Qj j'_2, Qj,j-i ; Qj,j ■ Qj,j+\■> Qj,j+ r 2 ) = 


(1, —4, 6, —4,1) for 2 < j < T — 2 for non-boundary parameters (Rue and Held, 2005). Under 
this construction, R is rank-deficient (of rank T — 2) as the rows sum to 0 since it is composed 
from second differences, so that the joint distribution is improper; in particular, the prior for the 
T x 1, fj, is invariant to the addition of any second order polynomial because the prior supplies 
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no information about such polynomials. We may view the joint distribution as the product of a 
proper distribution on the space of T — 2 differences (by employing the Moore-Penrose pseudo 
inverse, R~, and |R| as the product of the T — 2 non-zero eigenvalues of Q) and an improper, 
noninformative prior on the order 2 polynomials. These Gaussian Markov random field priors 
specified through a precision matrix have the property that f t j X fik\f.-jk ^ Rjk = 0 , which 
allows for a parsimonious Q, from which we specify a proper set of full conditional distributions 
that we use for posterior sampling, 


fij | fj ,—j i k ~ N 


j Q 'y ] Qjkfikj (kQjj) 

y k:k^j 


= AT 


/ 4 

( q 1 + fid- 1 ) 


1 

6 


(fij +2 + fid- 2 ), ( 6 k) 



( 2 a) 

( 2 b) 


where {A: : k ~ j} denotes the set of neighboring time points, {k}, of time, j. The prior mean 


for each /$,■ is composed as a weighted average of its order 2 nearest neighbors. Rue and Held 


(2005) refer to this construction as a random walk prior of order 2 or RW2(k). One may 
equivalently parameterize, R = k (D — pft), for diagonal weight matrix, D, with (diagonal) 
entries equal to those of Q. Element, j, of D, counts the number of neighbors of time, j, 
where the function value at a time point with more neighbors will have a relatively higher 
precision. Adjacency matrix, ft, is specified with OX on the diagonal and off-diagonal elements 
are filled with the negative of values from the off-diagonal elements of Q and encode adjacency 
relationships among the time points (Banerjee et al. 2003). Under this parameterization, 
— 1 < p < 1, is the autocorrelation parameter that shrinks the mean to 0, but produces a 
proper joint prior for £). The RW2(k) prior construction, however, sets p = 1. As with 
Equation[lJ we specify r e ~ Qa (c = 1.0, d = 1.0). 


The GP construction of Equation [I] includes parameters ( 62 , @ 3 ) for the length-scale that are 
estimated from the data, while the iGMRF prior hard codes the length scale in Q, suggesting 
more estimation flexibility for the GP, which we assess in Section [4} 
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2.4 Accounting for Dependence Among Functions 

We introduce an extension of Equation[lJ which indexes the GP covariance function parameters, 
{0.,:} = {{On, • • •, Oip)}, by state, i € (1,..., N ), to permit their probabilistic clustering with, 

fi ~ A f T (0, C (0i)) (3a) 

9 u ...,0n\ (3b) 
G ~ DP(a, G 0 ), (3c) 


where ... jv receive a random distribution prior, G , drawn from a Dirichlet process (DP), 

specified with a concentration parameter, a, a precision parameter that controls the amount 
of variation in G around prior mean, Go- The base or mean distribution, Go, is typically 
constructed as parametric; in our case, Go = @ a (1-0,1.0), which we used under the 

global GP model of Equation [l] parameterizing a single vector, 0, for all states. Equation [3] 
describes a mixture model of the form, f|G ~ f Mp (0, C (0)) G (d0), where G is the mixing 
measure over the P x 1 covariance parameters, 0. 

We examine the clustering property of the DP by expressing it in the discrete, stick breaking 


form of Sethuraman (1994), 


G=y^ Pk s e; 


( 4 ) 


h= 1 


a countably infinite mixture of weighted point masses, where “locations”, 0 }, ..., 0 * M , index 
the unique values for the {0;}, where M < N (states) that we interpret as clusters. The 
maximum number of clusters assigns each state to its own cluster, which countably increases 
with N. We record state cluster memberships with s = (si,...,sjv) where Si = i denotes 
0* = 0 * so that (s, {0m}) provides an equivalent parameterization to {0j} and we recover 0* = 
0* . We conduct posterior sampling with the cluster locations and assignments, rather than 
directly sampling { 0 *}, as the former produces notably better mixing because it separates re¬ 
assignments to clusters from updates to the values. The weight, ph £ (0,1), is composed as ph = 
Vh Utl (1 ~ v k) where Vh is drawn from the beta distribution, Be(l,a). This construction 
provides a prior penalty on the number of mixture components. A higher value for a , however, 
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will generate smaller values for {vh}, and hence, “breaks” off more clusters (with unique 
locations), {ph}, from the unit stick. We place a further prior, a ~ Q (1,1), to allow posterior 
updating in recognition of the relatively strong influence it conveys on the number of clusters 
formed (Escobar and West, 1995). 

The DP construction assumes exchangeability of the (#*), a priori , given random measure, 
G, but the almost surely discrete construction of the DP produces estimates which are not 
exchangeable, a posteriori. The prior specification for cluster assignments, (sj|s_j), (under 
our re-parameterization to { s ? (@m)m=l,...,M} achieved by marginalizing over G ), induces a 
uniform probability for co-clustering among the states. Let C T (0^) = C (0j^) + (1 /t e )It 


after integrating out f) from Equation la that produces the marginal likelihood, y,;|s, ( 9* n ) ~ 
A/"t (0, C T (0*.)). If the likelihood values for two states, j and k , A/r (yj|0, C r (0*)) and 
A/’t (yfc|0, C T (0*)), are both relatively high for assignment to cluster, Sj = Sk = s, due to 
underlying similarities in their economies or due to other factors related to their closeness in 
their geographic locations, then the posterior probability for co-clustering states j and k will 
be relatively high (versus uniform, a priori). This posterior estimation mechanism conducts 
unsupervised (probabilistic) clustering. Sharper (lower posterior variance) estimates over the 
space of clusterings may be obtained, particularly when the number of observations are higher 
(than our N = 51), by indexing either the weights or locations in Equation [4] to include 
predictors in lieu of our unsupervised formulation. 

An analogous extension is specified from Equation [2] with, 


fij\^i-ji { K i}i=l,...W ~ A^ j ^ ) Qjkfikj{^iQjj) 

y ^ k:k^j 

G ~ DP(a, G 0 ), 


(5a) 

(5b) 

(5c) 


that may be expressed as G ~ f AT J2k-.k~j Qjkfik, ( nQjj ) G ( dn ), marginal¬ 
izing over We specify base distribution, Go = Qa (1,0.1), (which generates lo¬ 

cations, {k^}) as weakly informative with a large variance. We select these hyperparameter 
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settings because they produce a larger prior variance for k in Equation [5] than for 0 under 
Equation [3j The parameterization of the precision matrix of each function using only a single 
parameter (k) makes the posterior sensitive to the prior specification. As with Equation |3j we 
sample k indirectly through cluster assignments, s, for the IV x 1 states, and location values 


The stick breaking formulation illustrated in Equation [4] to induce the unknown measure, 
G, may be easily generalized beyond the DP by varying the form for the Beta prior assigned to 


Vh ; for example, the Poisson-Dirichlet (PD) alternative of Pitman and Yor (1997) includes an 


additional “discount” parameter, 5, along with the DP concentration parameter, a , to specify 
the hyperparameters of the Beta distribution for Vh that tends to generate more locations, a 
priori , inducing a less informative prior for the number of clusters (which is random under DP 
and PD constructions). Large clusters, however, receive more prior weight under the PD than 
for the DP. The DP may be extracted as a special case of the PD and both may be located 
within a larger class of generalized gamma processes (Lijoi et al., 2007). The PD and DP turn 
out to produce nearly identical posterior results under our data application due to a high level 
of information in the data, so we focused our exposition on the simpler model. 


3. Computation 

The mixtures of Gaussian processes formulation of Equation [3] is far more computationally- 
intensive than than the mixtures of iGMRFs model presented in Equation[5]because computing 
the cholesky decomposition of the T x T covariance matrix is C(T 3 ). We report computation 
time comparisons in Section [4] where we will, however, also demonstrate that the mixtures 
of GPs is more robust in discovering the correct number of generating clusters than is the 
mixtures of iGMRFs. So we are motivated to mitigate the computational burden associated 
to drawing posterior samples under a GP prior on the functions. A typically-used approach to 
improve computation for a GP probability model employs a sparse approximation to the GP 


that utilizes some subset of the time points as “inducing” inputs (Vanhatalo et al. 2014). We 
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expect functions, {fj}, for our CPS data to each express multiple length scales because our 
estimation period encompasses over one hundred months, such that a sparse approximation 
may fail to capture key features. So we adapt a recently developed posterior sampling algorithm 


of Wang and Neal (2013) to our mixtures of GPs that generates a sequence of proposals for 


locations, {0* n }, in a lower dimensional space using a subset of the T time points to build 
the GP covariance matrix, C, which reduces computation time. Yet, sample draws under this 
method will be from the exact posterior distribution, rather than from a sparse approximation. 

Please see Appendix [A] for an exposition of the posterior sampling algorithms used for both 
the mixtures of GPs and iGMRFs models, which we have implemented in the growfunctions 


package for R Core Team (2014), using C++ to speed computation. The package is fully 


documented and available on CRAN for downloading and includes the data used in this paper. 


4. Simulation Study 

We conduct a simulation study with the goals to: 1). Assess and compare the accuracy of 
estimating de-noised functions between our GP and iGMRF formulations; 2). Assess and com¬ 
pare the accuracy of detecting clusters or sub-groups of de-noised functions. We generate each 
latent synthetic function, f (from which we render the observed time series, y) under different 
parameterizations than either of our models as a means to assess the relative adaptability and 
robustness of our estimation models to real-world data generating processes. 

Our first simulation focuses on generating relatively complicated surfaces from a mixture of 
K = 2 length scales specific to each of M = 3 clusters. We randomly (and disjointly) allocate 
N = 100 domains to M = 3 clusters for T = 158 time points, and generate de-noised functions, 
{fi}, in each cluster using the addition of two squared-exponential terms, 


w T (o,c(«;j + c («!,.,)) 
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The covariance formula used to generate each cell of the covariance matrix associated to each 
squared exponential term is given by, 


C fijJu ( e k , Si ) - 


{tij tii) 

O* 

V U . O 


where (j, £) is the cell indicator in the T x T, C. Term indicator, k G {1,2}, denotes the 
covariance term with associated locations, {#£ m i> @k m 2 }’ f° r a squared exponential covariance 
(which has two location parameters). Label (k,m, 1) indicates a location parameter that 
controls the vertical scale for covariance term k in cluster m, whereas (k, m, 2) denotes a 
location parameter that controls the length scale or degree of smoothness. We recall that the 
vector, s = (.si,..., sat) , s* G (1,..., M), assigns domains to clusters, so that the parameter 
controlling the vertical scale in covariance term, k = 1, for domain i is 0\ s . l . The columns of 
the matrix, 


e * m = 1 2 3 

2.61 0.38 0.91 

01 m 2 3.00 3.53 1.56 

0 *= x ’ m ’ 2 

1.04 2.26 0.84 

[ 0.22 0.15 0.71 J 

specify the set of 4 locations generated for each of the M = 3 clusters that we used to construct 
{f;}. Examining the length scale location parameters, 9*. 2 , in this table reveals that the 
surfaces in each cluster are formulated from the sum of a long length-scale or smooth term and 
a short length-scale or rough term. So each surface is drawn from a mixture of two scales. The 
long length scale locations were generated from, 9*. 2 ~ Qa{ 3,2), while the short length-scale 
locations were drawn from, 9*. 2 ~ Qa (2, 5). The vertical scale parameters for both terms were 
drawn from, 9*, 1 ~ Qa (3,3). 

The resulting observed surface for domain i, y* ~ Mt (£;, TT^t) , where the global noise 
precision parameter, r e , is set to produce a 20% noise-to-signal ratio (based on the average 
variance of the {f}}). 
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The first of our two primary inferential goals is to uncover the de-noised latent functions, 
{fj}, so we want to avoid over-fitting the observed noisy surfaces. We perform estimation after 
randomly setting 10% of the NT observations to missing as presented to the estimation models 
to use them as a test set. The GP and iGMRF models will estimate function values, {fij}, for 
each of these missing values and we compute a mean-squared prediction error (MSPE) based on 
the squared difference between the estimated and true function values for the test set. We then 
normalize the MSPE by the variance of the test set latent functions to produce a normalized 
MSPE. Figure [ 2 ] compares fitted results (solid lines) to data values (circles) for the iGMRF 
RW 2(k) model, in the left column, with the GP formulation under a single rational quadratic 
covariance formula, in the right column. The rows represent each of the M = 3 clusters and 
the plot in each cell is for a randomly-selected domain from the represented model-cluster 
combination. We see that the GP model tends to attenuate the peaks to a greater degree than 
does the iGMRF. The normalized MSPE values are 0.39 for the GP and 0.54 for the iGMRF. 
The GP fits notably better precisely because it avoids over-fitting. 

Our second inferential goal is to utilize the clustering to provide context about how the 
functions compare among the observed domains (e.g. geographic or industry labels). Both 
the GP and iGMRF models employ a DP mixture formulation under which a clustering is 
generated on each posterior sampling iteration that assigns the covariance or precision param¬ 
eters, respectively, to clusters. Both the number of clusters, M, and the allocations of domains 
to those clusters, may change on each posterior sampling iteration, so that these samples, 
taken together, formulate draws from the posterior distribution over the space of clusterings 
(or partitions). The relative concentration of this distribution may be assessed by examining 
the degree of similarity in number of clusters and assignment of the domains to those clusters 
among the posterior draws. 


We select one clustering of domains from among the MCMC draws by using the least- 


squares algorithm of Dahl (2006) that builds an IV x IV matrix of pairwise clustering proba¬ 


bilities to summarize the posterior draws and selects that clustering which is “closest” to this 
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Figure 2: Posterior mean values for estimated functions, {£)}, from the iGMRF model, in 
the left-hand column, and under the GP, in the right-hand column under a simulated data set 
generated from a GP mixture of two squared exponential covariance terms. The rows represent 
each of the M = 3 clusters and each plot includes a randomly-selected domain. 


matrix under a squared Euclidean distance metric. Once we have selected a clustering, we 
build an N x N pairwise matrix, where cell (i-j) is filled with a 1 if states i and j are clustered 
together; otherwise, it is filled with a 0. We also build this pairwise clustering matrix for the 
true clustering used to generate the observed (synthetic) values. We then compare the true 
and model-based pairwise matrices and compute a percent mis-clustering based on the sum of 
the cell values that don’t agree. 

Figure [3] presents a visual schematic of the true clustering for our synthetic data and the 
estimated clustering structure under each of the two models. Each plot panel (in each of three 
groupings of panels) collects the posterior mean values for estimated functions assigned to the 
cluster for that panel. Each grouping of panels represents a clustering under a different model. 
The top row presents the true clustering under 3 clusters used to generate the observed {y*}, 
while the second row presents the estimated clustering for the GP model (selected under the 
least squares algorithm) and the last two rows present the clustering for the iGMRF model. 


15 
















Savitsky 


We see that the GP model well reproduces the generating cluster with a mis-clustering error 
of only 2%. The iGMRF model discovers the 3 clusters, but also creates echoes of the third 
cluster, producing a total of 5 clusters and a resulting mis-clustering error of 20%. 
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Figure 3: Cluster sub-groupings of the posterior means for estimated functions, {f}. Each 
panel plots the subset of functions assigned to a cluster under a simulated data set generated 
from a GP mixture of two squared exponential covariance terms. A loess smoother is added in 
each panel with a red dotted line to highlight patterns among the assigned functions. Each row 
represents the clustering or partition under a model. The first row presents the true functions 
and clustering, while the second row presents the estimated clustering from GP model and the 
last two rows, the GMRF model. 



Perhaps we expect the GP model with a rational quadratic covariance formulation to 
perform relatively better than the iGMRF model since the rational quadratic is designed for 
estimation under varying scales. So our next simulated procedure generates latent functions 
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from a model closer to the iGMRF. Recall that that the T x T precision matrix of GMRF 
may be expressed as, Rj = k*. (D — pfi), where we set the strength of correlation parameter, 
p = 1 (which produces a rank-deficient precision matrix), to produce our iGMRF formulation. 
If we set 0 < p < 1, we recover a full rank precision matrix, but at the expense of shrinking 
the conditional mean values to 0. We draw, 


A/'t (0,k* (D -0.95ft)) 


to construct our next synthetic dataset, under a second order specification for {D, il}, as de¬ 


scribed in Section 2.3 using the same N, T, M as for the first simulation. Locations, 
are generated from a Qa (1,1). As before, we set r e to produce a 20% noise-to-signal ratio. The 
resulting functions will be similar to those that would be generated under an iGMRF. Figure [4] 
demonstrates that the GP formulation (under the rational quadratic covariance formulation) 
tends to produce a smoother fit, making it easier to separate signal from noise. The fit perfor¬ 
mances are very similar between the GP and iGMRF models, with normalized MSPE values of 
0.157 and 0.159, respectively. Even though both models deliver similar fit accuracy, Figure [5] 
reveals that the GP formulation continues to do a better job of discovering the true clustering. 
Again, both models discover the framework of the true clustering, though the iGMRF model 
tends to echo one of the clusters. The mis-clustering rates here are 0% and 12% for the GP 
and iGMRF models, respectively, indicating that they both do a good job, but the GP model 
does well even when the underlying true functions are generated from a different process than a 
GP. The GP employs 3 parameters in it’s covariance formula, giving it the flexibility to detect 
both the large vertical scale differences that differentiate the clusters and to model length scale 
variations within a function. The iGMRF does better here than under the first simulation 
precisely because the clusters are primarily differentiated by (vertical) scale. 

We conclude our simulation study by estimating the GP formulation of Equation [l] where 
we exclude the modeling of clusters among {£;}. We seek to examine the out-of-sample fit 
performance when we ignore the dependence of the time-indexed collection of functions among 
the domains. We compare MSPE fit performance between including and excluding the clus- 
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Figure 4: Posterior mean values for estimated functions, {f*}, from the iGMRF model, in the 
left-hand column, and under the GP, in the right-hand column under a simulated data set 
generated from a proper GMRF of order 2 with strength-of-correlation, p = 0.95. The rows 
represent each of the M = 3 clusters with a randomly-selected domain. 


tering prior on the second simulated dataset. The resulting normalized MSPE of 0.3 under 
the model that excludes the clustering prior represents a substantial deterioration of fit from 
results under both the GP and GMRF constructions of Equations [3] and [5] which model the 
dependence among the domains. So ignoring the dependence among domains (e.g. states) may 
reduce the accuracy of estimating the latent, de-noised functions, {f)}. 

While the GP model demonstrates superior robustness, it comes at a computational cost. 
Under our implementation of tempered transitions in a temporary space for sampling GP 
locations, we roughly achieve the same information (effective sample size) in 10000 iterations 
of the GP model as we do in 25000 iterations of the iGMRF, though former consumes 28900 
CPU-seconds and the later only 1226 CPU-seconds on a single core of an Intel-i7-3450-powered 
laptop where the estimation routines are both written in C++. The successively coarser 
covariance approximations for likelihood evaluations in the temporary space were computed 
with (100,60) of the T = 158 time points. It is likely we could better if we were to optimize 
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Figure 5: Cluster sub-groupings of the posterior means for estimated functions, {f}. The 
estimation was performed on a simulated data drawn from a proper GMRF of order 2 with 
strength-of-correlation, p = 0.95. Each row represents the clustering or partition under an 
estimation model. The first row presents the true functions and clustering, while the second 
row presents the GP model and third, the GMRF model. 


these choices (from the standpoint of computation time per effective sample size), though the 
large difference will remain. 

5. CPS Employment Application 

We return to our motivating application for which we focus on reducing the noise-induced 
volatility in the T = 158 month series of employment totals for N = 51 states (including 
the District of Columbia) reported from the CPS. Each state’s observed employment series is 
standardized to remove the overall magnitude in order to facilitate comparisons across states 
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based on similarities in the shapes and patterns expressed. States seek context for changes 
in employment totals as they filter down to the county-level estimates, so we will perform 
inference on the distribution over clusterings to understand which states express similarities in 
their employment patterns. 


We fit the GP mixture model of Equation [3] to the CPS data. Figure [ 6 ] presents the 
allocation of estimated latent functions, {f)}, into assigned clusters of the clustering selected 


under the least squares criterion of Dahl (2006). We recall that the least squares algorithm 
selects a single clustering from among the set of posterior samples for the clustering. Each 
panel of the figure displays the estimated functions for those states assigned to each of 2 clusters 
estimated by the model. Examining the pattern in the collection of functions assigned to each 
cluster reveals that the clusters are primarily differentiated by the employment sensitivities 
expressed by member states to the “great recession” that began in approximately 2008 (at 
month 96). The latent functions for those states included in the left-hand panel express a 
notably longer lasting trough than those states included in the right-hand panel (where panels 
index cluster memberships). The initial rate of employment drop appears similar in both 
clusters, but the decline continues longer and recovers more slowly among states assigned to 
the left-hand cluster. We conducted additional runs that varied the shape hyperparameter of 
the DP concentration prior on a ~ Qa ( 04 ,1), with 04 6 {1 — 4} to induce successively higher 
prior number of clusters, but the estimated models all performed very similarly and the M = 2 
clustering was selected. Estimation under the iGMRF mixture model of Equation [5] produces 
the same selected clustering. The left-hand cluster includes a number of states that experienced 
so-called bubbles in property values, such as California, Florida, Hawaii, and Nevada, as well as 
states sensitive to both financial services and auto industries, such as New York and Michigan. 
Table[l]lists the 38 states assigned to the left-hand cluster of Figure [ 6 ] and the 13 states assigned 
to the right-hand cluster. Each panel in Figure [7] presents the fitted latent functions from the 
GP model of Equation [3] with a pink line and the associated credible intervals shaded in gray, 
compared to the actual observed data values. The states represented in the top row of panels are 
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time 


Figure 6: Posterior mean estimated functions, {f}, grouped into clusters. Each panel plots the 
set of functions assigned to a cluster for the CPS state-level dataset under the selected least 
squares clustering. A loess smoother is added in each panel with a red dotted line to highlight 
a pattern among the assigned functions. 


Cluster Member States 


Left AK, AL, AR, CA, CO, CT, DC, FL, GA, HI, IA, IL, IN, ICS, KY, LA, MD, MI, 
MN, MO, MS, ND, NE, NV, NY, OK, OR, PA, RI, TX, UT, VA, VT, WV, WY 
Right AZ, DE, ID, MA, MT, NC, NJ, OH, SC, SD, TN, WA, WI 


Table 1: State Memberships in 2 Estimated Clusters 


assigned to the left-hand cluster of Figure [6j while the states presented in the bottom row are 
assigned to the right-hand cluster. Examining the states assigned to the left-hand cluster, both 
New York and Michigan experienced a relatively more rapid descent in employment, despite the 
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Figure 7: Posterior mean values for estimated functions, {f}, from the mixtures of GPs model, 
represented by the pink lines, with 95% credible intervals displayed as gray shading compared 
to the noisy series, {y}, represented by the black hollow circles, connected with black dashed 
lines. Each panel represents a series for a single state. The state plots in the top row are 
assigned to the left-hand cluster of Figure [6j while the bottom row plots are from the right- 
hand cluster. 


large volatility in New York, while Florida appears to have experienced a large, nearly sudden, 
drop followed by a stabilization and recovery. Contrastingly, the Arizona and Massachusetts 
appear to have rather rapidly bounced up from the initial drop, while Ohio experienced a less 
dramatic fall in employment. The cluster memberships provide state administrators context 
on the common economic drivers for their employment trends through the great recession. 
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Figure 8: Posterior mean estimated function for the state of Nebraska from the mixtures of 
GPs model, represented by the pink line, with 95% credible intervals displayed as gray shading 
compared to the noisy series, {y}, represented by the black hollow circles, connected with 
black dashed lines. 


Lastly, the solid, pink line in Figure [8] displays the de-noised estimated function for the state 
of Nebraska compared to the noisy, CPS estimates (displayed in the hollow circles connected 
with a dashed line) to demonstrate the reduction of noise accomplished under our mixtures of 
GPs model. 

6. Discussion 

Our task was to estimate latent, state-indexed functions that separate away noise-induced 
volatility present in our observed CPS data to improve the efficiency and interpretability of re- 
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suiting county-level employment statistics constructed from the state estimates. We developed 
nonparametric mixture formulations that simultaneously estimate the latent, state-indexed 
functions and allow the data to discover a general dependence structure among them. Our 
simulation study results demonstrated that failing to account for a dependence structure among 
states in the estimation model deteriorates the ability to uncover the true latent functions. 


Our DP mixture of GPs or iGMRFs, outlined in Equations [3] and [5j employ an unsuper¬ 
vised approach for discovering dependence among the state functions based on similarities in 
the time-indexed patterns expressed in the data. They perform well on our CPS employment 
count application, uncovering a differentiation among states based on their employment sen¬ 
sitivities to the great recession. The simulation study revealed a greater robustness, both in 
estimation of accuracy of the latent functions and their clustering properties, of the GP mixture 
model as compared to the iGMRF mixture, due to the regulated smoothness properties of the 
rational quadratic covariance formulation and its employment of more parameters than under 
the iGMRF that control for the trend, scale and frequency properties of the estimated latent 
functions. The iGMRF computes much faster, however, so it may still be useful, particularly 
in the case where the clusters are differentiated based primarily on vertical magnitude of the 
functions. The computational intensity of the GP mixture is mitigated with our adaptation 
of the faster-computing MCMC algorithm of Wang and Neal (2013) (which uses tempered 
transition steps in a temporary, lower-dimensional space) from a single GP construction to our 
nonparametric mixtures of GPs model. 


Our methodology may be extended by employing domain level predictors to help determine 
the distribution over clusterings with an approach such as the probit stick-breaking process of 


Rodriguez and Dunson (2011). This approach would replace the single unknown measure, G, 


imposed on the parameters of the covariance or precision matrix with an uncountable collection 
of unknown measures, {G x } xgffi d, of such measures based on the values for d predictors. The 
collection of measures is induced by indexing the weights of the stick breaking construction of 
Equation [3] with x. This approach may be expected to more precisely estimate the dependence 
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structure in the case of a large number of domains (such as counties, rather than states) on 
the order of 1000 or more. 
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Appendix A. Posterior Sampling Algorithms 
A.l Gaussian Process Sequential Scan 

We sample the parameters of the mixtures of GPs model, A = (@*,s,cq r e ), in a sequential 
scan from the full conditionals after marginalizing over the latent, de-noised functions, {fj}, 
which is easy to do and leads to slightly better mixing. We must, however, co-sample the 
{fj} in the case of intermittent missingness-at-random in {y^} in order to allow the sampling 
of missing values from their posterior predictive distribution. We first outline our sequential 
scan in the case where we marginalize out the latent functions and conclude this section by 
highlighting changes required in the case we co-sample them. The mixtures of GPs model 
specification is highly non-conjugate. We carefully configure blocks of parameters to permit 
application of posterior sampling approaches designed to produce robust chain mixing under 
non-conjugate specifications. 

1. Sample cluster locations, {9^ m }p=i,...,p- m =l (where p denotes parameter type (e.g. 
(0*,#2 j 03)) and m = 1 ,...,M denotes the cluster) and model error precision, r e : We 
sample the posterior distribution for locations in by-cluster groups, {9^m}p=i,...,p, from 
the subset of observations (states) assigned to that cluster because 8* m X 0* , for rn / 
m, a posteriori, in a Metropolis-Hastings scheme using the following log-posterior kernel, 


log vr {e* pm \e*_ pm , s, r e , {yi : s* = m}) 

OC -^n m log (|C T (0* m ) I) - ^ Yl y'i CT yi + («-!) lo §(^pm) - b6 *pm, (6) 

where CT = C (9* m ) + (l/r e ) Ip, s 7 // collects the states assigned to cluster m , Tim 
denotes the number of states assigned to cluster m, and (a, b) are shape and rate hyper- 
parameters of a gamma prior, respectively. This posterior representation is a relatively 
straightforward Gaussian kernel of a non-conjugate probability model. 


We adapt a Metropolis-Hastings algorithm of Wang and Neal (2013) designed to speed 
computation for sampling each 9* m (and also, r e ) by introducing a lower-dimensional 
temporary space where the likelihood (e.g. the T x T, Gaussian process covariance 
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matrix, C) is approximated using a subset of the T time-points. Starting with the 
previously sampled value, xq, (for 9* m or r £ ) from the full-dimensional or exact space, 
the sampling algorithm builds a sequence of proposed transitions by first “stepping up” in 
the temporary space using increasingly coarse, “tempered” transitions, (yZ\. Z 2 ,..., Z n ^j , 
that generate computationally-fast approximations for C. These approximations use 
fewer observations in each step; for example, we apply n = 2 transitions in the lower¬ 
dimensional space and use 100 of the T = 158 time points to formulate Z\ and then 
60 time points for Z 2 . This sequence of coarser transitions is followed by transition 
steps that employ progressively finer distributions (in reverse order), (Z 2 ,Zi) that ’’step 
down” to guide the chain back towards the full dimensional space until we conclude the 
sequence by proposing xo from Z\ to be evaluated in the full-dimensional space. We use 
univariate slice sampling of Neal (2000bj) to accomplish these (reversible) transitions in 
the lower-dimensional space. The proposal (reversible sequence of steps) is then accepted 
with (for n = 2 tempered transitions) probability of move formulation, 

'7ri(x 0 )7r 2 (xi)7ri(xi) 7r(x 0 ) 


mm 


^7ri(xi)7r 2 (xi)7ri(xo) ir(x 0 )J ' 

where “x” denotes the proposals, and “ 7 r” posterior kernel evaluations, for each 9* m 
with subscript 0 pertaining to the exact space and (1,2) to the successively coarser 
transition distributions in the temporary space in the sequential order of application. If 
our lower dimensional approximations are relatively good, this approach will speed chain 
convergence by producing draws of lower autocorrelation since each proposal includes a 
sequence of moves generated in the temporary space. Since the moves in the temporary 


space are executed with fast approximations, Wang and Neal (2013) show that this 
algorithm has the potential to substantially reduce computation time, as compared to 
the usual Metropolis-Hastings algorithm, for drawing an equivalent effective sample size. 
The probability of move formulation evaluates the proposals on the full space of size T, 
however, so that the resulting sampled draws are from the exact posterior distribution, 
rather than from a sparse approximation. 
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Our adaptation configures Wang and Neal (2013) to apply to clusters of Gaussian pro¬ 
cesses (rather than to just a single GP) by exploiting the between cluster posterior 
independence of the {0* m }. 

2. Sample cluster assignments, s = (si,..., sjv): We marginalize over G in Equation [3j that 
results in the Polya urn representation of Blackwell and MacQueen ( |1973[ ), to sample s 
from its full conditionals, 


7T (Si = s|s_j, 0*, a, T e ,Yi) oc < 


A/'t ( yi |0,C r (0*)) if 1 < S < M- 


( 7 ) 


a/c* 
k n—l+a 


N t (y*|o,C T (0s)) if s = M- + h, 


where n_j iS = Yi'^i l(s(0 = s ) is the number of states, excluding state i, assigned to 
cluster s, so that states are assigned to an existing cluster with probability proportional 
to its “popularity”. The posterior assigns an state (through Si) to a new cluster with 
probability proportional to ado under the mixture prior in the case of a conjugate for¬ 
mulation. The conjugate specification requires the likelihood to be integrable in closed 
form with respect to the base distribution, Go, to compute do = f M (y\6....) Go(dd), 
which is not the case under a (nonconjugate) GP construction. So we employ the auxil¬ 


iary Gibbs sampler formulation of Neal (2000a) and sample c* € N locations from base 


distribution, Go, ahead of any assigned observations (e.g. not yet linked to any state), to 
define h = M~ +c* candidate clusters in an augmented space. We then draw s t from this 
augmented space, where any location not assigned states (over a set of draws for s) is 
dropped. This procedure effectively performs a Monte Carlo integration of the likelihood 
with respect to the base distribution. See Savitsky and Vannucci (2010) for a detailed 
example of a DP implementation on a GP. The larger is the tuning parameter, c*, the 
lower is the autocorrelation of the resulting posterior draws (though computation time 
increases because we are sampling with respect to more cluster locations). Good mixing 
is typically achieved with c* set equal to 2 or 3. We note that the number of clusters, 
M , may change in this step as clusters are added and deleted. 
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3. Sample DP concentration parameter, a: We use the algorithm of Escobar and West 


(1995) that formulates the posterior for a as a mixture of two Gamma distributions with 
the mixture component, 77 , drawn from a beta distribution. The algorithm is facilitated 
by the conditional independence of a. from data, Y, given cluster assignments, s, (that 
implies number of clusters, M). 


4. Draw f \ as post-processing step from its posterior predictive distribution: Use the Gaus¬ 
sian joint distribution for (f,;, y,;) to formulate the conditional posterior predictive Gaus¬ 
sian density, 

fz|y» ~ A/'t (mi, Vi) , (8) 

where m, = C (0* ) [CT fa*,)]" 1 * and V t = C (0* ) - C (0* ) [C T (0* (0* ). 

5. In the case of missing data values, we co-sample {f)} (rather than marginalizing over 
them) after the cluster locations in a Gibbs scan from, 

7r(fj|yi,0*,T e ) = N t ( 0 t ” 1 e i , 0 r 1 ) , 

where T x 1, e* = f-iyyj and 0,; = r e ly + C (0* ) *, where we cache the cholesky 
decompostion of C from sampling locations, 0*, for faster computation of C -1 . Under 
co-sampling of the {f,}, we draw the model error precision,r e , in a Gibbs step (instead 
of in the Metropolis-Hastings scheme along with 0*) using, 

7T (r e |Y, (fj) = Q (ai, 61), 

where shape hyperparameter, ai = 0.5TlV-|-a and rate hyperparameter, bi = 0.5 YaLi (y i ~ fj) (y i — fj)+ 
b and (a, b ) are the prior hyperparameter settings. Lastly, we replace y % with under 
co-sampling of f \ in Equations [6] and [7] for sampling cluster locations, 0*, and cluster 
memberships, s. 


A.2 Intrinsic Gaussian Markov Random Field Gibbs Scan 

Posterior sampling from the mixtures of iGMRFs employs a Gibbs scan over A = ({fj}, s, a, r € ), 

from a set of conjugate full conditionals in the following steps: 
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1. Sample latent functions, in a Gibbs steps from the full conditionals, 

K T e, Vij ) = AA , 

where = r e yij + QjjK* s Jij, with Yk:k~j Qjkfik and fcj = r e + QjjK*.. 

2 . Sample locations, {«*„}, in a Gibbs step from, 


* ( K m\{fij}) = Sa(a 2 ,b 2 ), 

with shape parameter, a 2 = \n m (T — oq) + a, where oq = 2 is the rank-deficiency 
of the precision matrix, Q, indicating that that f, provides the equivalent of T — oq 
degrees of freedom, rather than number of time points, T. The rate parameter, b- 2 = 
\Yvsi=m {Yj=iQjj[fi> ~ + b, where the rate parameter is composed from the 

subset of latent functions, { fi}i: Si =m , for those domains assigned to cluster m. 


3. Sample cluster assignments under a similar Polya urn representation as for the GP, only 
here the mixture posterior is conjugate, so, 


Tr(Si = s|s_j, K*, «, T e , fj) oc < 


-l+a IIj=l {fijlfiji [ K Si Qjj\ X ) if 1 < S < M 


(9) 


n— l+a 


do,, 


if 8 = M~ + 1, 


where do,?. = /A/" (v\k, ■ ■ •) Go(dn) = — - — , where a 2 is as dehned above 

t \ a ) b 2 ,i 

and b 2) i is term i in the sum that composes b 2l dehned above. 


Finally, we sample the DP concentration parameter, a, and the model noise precision, r e , in the 
same fashion as for the GP. Unlike the mixtures of GPs, we specify this model in a conjugate 
formulation that allows for fast sampling. 
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